vision and text
Introducing the First Self-Supervised Algorithm for Speech, Vision and Text
Self-supervised learning algorithms for images, speech, text or other modalities function in very different ways, which has limited researchers in applying them more broadly. Because an algorithm designed for understanding images can't be directly applied to reading text, it's difficult to push several modalities ahead at the same rate. With data2vec, we've developed a unified way for models to predict their own representations of the input data, regardless if it's speech, text or audio. By focusing on these representations, a single algorithm can work with completely different types of input.